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O (54) Title: ATIRIBUmBASED WORD MOIHEUNG 

vo 

(S7)'Ah«rat±f An attribute-based spcedh recognition systra is described. A speedi prqirooessor recsves input speedi and pro- 
duces a sequence of acoustic observations representative of ^ input speech A database of context-dq)endent acoustic models 
characterize a probability of a given sequ^ce of sounds producing the sequence of acoustic observations. Each acoustic model in- 
chides pboi^tic attributes and suprasegmental iK>n-irfK>netic attributes. A finite state language model characterizes a prc^>ability of a 
givoi sequence of words being spoken. A one-pass decoder compares the sequence of acoustic observations to tbe acoustic models 
1^ and the language model, and outputs at least one word sequ^ice rqaesentative of tiie input speech. 
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Attribute-Based Word Modeling 

Field of the Invention 
The invention relates to automatic speech recognition, and more 
particularly, to word models used in a speech recognition system, 

5 Background Art 

Automatic speech recognition (ASR) systems do not effectively address 
variations in word pronunciation. Typically, ASR dictionaries contain few 
alternative pronimciations for each entry. In natural speech, however, words 
rarely follow their citation forms. This failure to capture an important source of 

10 variability can cause recognition errors, particularly in normal conversational 
speech. 

The automatic inference of pronimdation variation has been explored 
using phonetically transcribed corpora. Unfortunately, increasing the number of 
dictionary entry variants based on a pronimciation model also increases the 

15 confusability between dictionary entries, and thus often leads to an actual 
performance decrease. 

Speaking mode has been considered to reduce confusability by 
probabilistically weighting alternative pronunciations depending on the speaking 
style. See F. Alleva, X. Huang, M.-Y. Hwang, Improvements on the Pronunciation 

20 Prefix Tree Search Organization, Proc. Int. Conf. on Acoustics, Speech and Signal 
Processing, Atlanta, GA, pp. 133 - 136, May 1996 (incorporated herein by 
reference). This approach uses pronunciation modeling and acoustic modeling 
based on a wide range of observables such as speaking rate; duration; and 
syllabic, syntactic, and semantic structure — contributing factors that are 

25 subsiimed in the notion of speaking mode. See, e.g., M. Ostendorf, B. Byrne, M. 
Bacchiani, M. Finke, A. Gimawardana, K. Ross, S. Roweis, E. Shriberg, D. Talkin, 
A.Waibel, B. Wheatley, and T. Zeppenfeld, Systematic Variations in Pronunciation 
via a Language-Dependent Hidden Speaking Mode, in International Conference on 
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Spoken Language Processing, Philadelphia, USA, 1996 (incoq^orated herein by 
reference). 

Just as the phonetic representation of careful speech is a schematization of " 
articulatory and acoustic events, a phonetic transcription of relaxed iitformal 

5 speech by its nature is a simplification. Pronunciation models implementing 

purely phonological mappings generate phonetic transcriptions that underspecify 
durational and spectral properties of speech. Reduced variants as predicted by a 
pronunciation model ought to be phonetically homophonous — e.g., the fast 
variant of "support" being pronotinced as /s/p/o/r/t/ is phonetically ho- 

10 mophonous with "sport"). But for to create such homophony, not only should the 
unstressed vowels be deleted, but the durations of the remaining phones also 
should take the same values as in words not derived from fast speech vowel 
reduction. Similarly, fast speech intervocalic voicing in a word like "faces" cannot 
be precisely represented as /f/ey/z/ih/z/ — ^phonetically homophonous with 

15 "phases" — unless both the voice value of the fricative as well as the durational 
relationship between the stressed vowel and the fricative have changed. 

Brief Description of the Drawings 
The present invention will be more readily imderstood by reference to the 
20 following detailed description taken with the accompanying drawings, in which: 
Figure 1 illustrates a prefix search tree according to a representative 
embodiment of the present invention showing roots, nodes, leaves, and single 
phone word nodes (stubs). 

Figure 2 illustrates the heap structure of a root node. 
25 Figure 3 illustrates the heap structure of a leaf node. 

Figure 4 illustrates the heap structure of a stub. 
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Detailed Description of Specific Embodiments 
The foregoing suggests that capturing the complex variability of 
conversational speech with purely phone^based speech recognizers is virtually 
impossible. Embodiments of the present invention generalize phonetic speech 
5 transcription to an attribute-based representation that integrates supra-segmental 
non-phonetic features. A pronimciation model is trained to augment an attribute 
transcription by marking possible pronimciation effects, which are then taken into 
accoimt by an acoustic model induction algorithm. A finite state machine single- 
prefix-tree, one-pass, time-synchronous decoder is used to decode highly 
10 spontaneous speech within this new representational framework. 

I In representative embodiments, the notion of context is broadened from a 

purely phonetic concept to one based on a set of speech attributes. The set of 
attributes incorporates various features and predictors such as dialect, gender, 
articulatory features (e.g. vowel, high, nasal, shifted, stress, reduced), word or 
15 syllable position (e.g. word begin/end, syllable boundary), word class (e.g pause, 
function word), duration, speaking rate, fundamental frequencies, HMM state 
(e.g. begin/middle/end state), etc. This approach affects all levels of modeling 
within the recognition engine, from the way words are represented in the dictio- 
nary, through pronunciation modeling and duration modeling, to acoustic 
20 modeling. This leads to strategies to efficiently decode conversational speech 
^jvithin the mode dependent modeling framework. 

A word is transcribed as a sequence of instances ( ) which are 
bimdles of instantiated attributes (i.e. attribute-value pairs). Each attribute can be 
either binary, discrete (i.e. multi-valued), or continuous valued. For example, the 
25 filled pause "imi" is transcribed by a single instance i consisting of truth values for 
the following binary attributes (pause, nasal, voiced, labial ...). 

The instance-based representation allows for a more detailed modeling of 
pronunciation effects as observed in sloppy informal speech. Instead of predicting 
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an expected phonetic surface form based on a purely phonetic context, the 
canonical instance-based transcription is probabilistically augmented. A 
pronunciation model predicts instances for the set of attributes. Instead of 
mapping from one phone sequence to another, the pronimdation model is trained 
to predict pronunciation effects: 

Pronunciation variants are derived by augmenting the initial transcription by the 
predicted instances: 

which are then weighted by a probability: 

1 * 

where Z is a normalizing constant. 

Predicting pronunciation variation, as described above, by augmenting the 
phonetic transcription by expected pronimciation effects avoids potentially 
homophonous representation of variants (see, e.g., M. Finke and A. Waibel, 
Speaking Mode Dependent Pronunciation Modeling in Large Vocabulary Conversational 
Speech Recognition, in Proceedings of Eurospeech-97, September 1997, 
incorporated, herein by reference). 

The original transcription is preserved, and the duration and acotistic 

model building process exploit the augmented annotation. Decision trees are 

grown to induce a set of context dependent dtiration and acoustic models. The 

induction algorithm allows for questions with respect to all attributes defined in 

the transcription. Thus, starting from the augmented transcription, context 

dependent modeling means that the acoustic models derived depend on the 

phonetic context, pronunciation effects, and speaking mode-related attributes. 

This leads to a much tighter coupling of pronimciation modeling and acoustic 

modeling because model induction takes the pronimciation predictors into 
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account as well as acoustic evidence. 

For coherence of training, testing, and rescoring results, a corresponding 
LVCSR decoder should handle finite state grammar decoding, forced aligrunent of 
training transcripts, large vocabulary statistical grammar decoding, and lattice 

5 rescoring. One typical embodiment uses a single-prefix-tree time-synchronous 
one-pass decoder that represents the underlying recognition grammar by an ab- 
stract finite state machine. To have reasonable efficiency in a one-pass decoder, 
the dictionary is represented by a pronunciation prefix tree as described in H. 
Ney, R. Haeb-Umbach, B.-H. Tran, M. Oerder, 'Improvement In Beam Search For 

10 lOOOO'Word Continuous Speech Recognition/' DEEE International Conference on 
Acoustics, Speech, and Signal Processing, Vol. 1, pp. 9-12, 1992, incorporated 
herein by reference. 

Two problems can result from this representation. First, if the tree is 
reentrant, then only the single best history may be considered at word transitions 

15 at each time t. Second, the application of the grammar score is delayed since the 
identity of the word is only known at the leaves of the tree. To deal with the first 
problem, a priority heap may represent alternative linguistic theories in each node 
of the prefix tree as described in previously cited AUeva, Huang, and Hwang. The 
heap maintains all contexts whose probabilities are within a certain threshold, 

20 thus avoiding following only the single best local history. The threshold and the 
heap policy have the benefit of allowing different more or less aggressive search 
techruques by effectively controlling h3^thesi5 merging. In contrast to the tree 
copying process employed by other recognizers, the heap approach is more 
dynamic and scalable. 

25 The language model is implemented in the decoder as an abstract finite 

state machine. The exact nature of the underlying grammar remains transparent 
to the recognizer. The only means to interact with a respective language model is 
through the following set of functions in which FSM is a finite state machine- 
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based language model: 

FSManitialO— Returns the initial state of the FSM. 

FSM .arcs(state) — ^Returns all arcs departing from a given state. An arc consists of 
the input label (recognized word), the output label, the cost, and the next 
state. Firute state machines are allowed to be non-deterministic, i.e. there 
can be multiple arcs with the same input label. 

FSM.cost(state) — ^Returns the exit cost for a given state to signal whether or not a 
state is a final state. 

This abstraction of the language model interface makes merging of 
linguistic theories a straightforward and well-defined task to the decoder: two 
theories fall into the same congruence class of histories and can be merged if the 
state indices match. The finite state machine is designed to return which theories 
can be merged. One advantage of this division of labor is that the decoder can 
decode grammars of any order without any additional implementation effort. 

To deal with fiUer words, i.e. words that are not modeled by a particular 
FSM grammar (these are typically pauses such as silence and noises), the decoder 
may virtually add a self loop with a given cost term to each grammar state. As a 
result any nimiber of filler words can be accepted /recognized at each state of the 
finite state machine. 

One typical embodiment provides a set of different instantiations of the 
finite state machine interfaces that are used in different contexts of training, 
testing or rescoring a recognizer: 

• Finite State Grammar Decoding — Of cdiirse, the FSM interface may 
explicitly define a finite state grammar. Besides its use in command-and-control, 
this application can be used in training the recognizer. In M. Finke and A. Waibel, 
Flexible Transcription Alignment, in ASRU, pages 34-40, Santa Barbara, CA, 
December 1997, we showed that, when dealing with unreliable transcripts of 
training data, a significant gain in word accuracy can be achieved by training 
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from probabilistic transcription graphs instead of the raw trar\scripts. Typical 
embodiments aUow for decoding of right recursive rule grammars by simulating 
an imderlying heap to deal with recursion. The transcription graphs of the 
Flexible Trar\scription Alignment (FTA) paradigm may be expressed in the 
5 decoder by a probabilistic rule grammar. Thus, forced alignment of training data 
is basically done through decoding these utterance grammars. 

• N-gram Decoding — ^Statistical n-gram language models are not explicitly 
represented as a finite state machine. Instead, a finite state madiine wrapper is 
built aroimd n-grain models. Hie state index codes the history such that 

10 FSM.arcs(state) can retrieve all the language model scores required from the 
imderlying n-gram tables. This implies that the FSM is not minimized and the 
state space is the vocabulary to the power of the order of the n-gram model. 

• Lattice Rescoring — ^Lattices are finite state machines, too. So, rescoring a 
word graph using a different set of acoustic models and a different language 

15 model is feasible by decoding along lattices and on-the-fly composition of finite 
state machines. 

Grammar probabilities should be incorporated into the search process as 
early as possible so that tighter pruning thresholds can be used for decoding. 
Within the finite state machine abstraction, lookahead techniques can be 

20 generalized to any kind of FSM based language model. See, e.g. S. Ortmanns, A. 
Eiden, VL. Ney, and N. Coenen, Look-Ahead Techniques for Fast Beam Search, in 
Proceedings of the ICASSF'97, pages 1783-1786, Munich (Germany), 1997, 
incorporated herein by reference. For each state, the decoder needs to derive — on 
demand — sl cost tree that reports for each node of the prefix tree what the best 

25 language model score is going to be for all words with a given prefix. For a 
trigram-based FSM, the lookahead tree will thus be a trigram lookahead; for 
fourgrams, a fourgram lookahead; and for finite state grammars, the lookahead 
will be a projection of aU words allowed at a given grammar state. In order to 
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compute finite state machine lookahead trees efficiently on demand, several 
techniques can be combined: 

• Lookahead trees may be saved in an aging cache as they are computed to 
avoid recomputing the tree in subsequent frames. 
5 • The size of the cache and the number of steps to compute the tree can be 
reduced by precomputing a new data structure from the prefix tree: the cost tree. 
The cost tree represents the cost structure in a condensed form, and turns the 
rather expensive recursive procedure of finding the best score in the tree into an 
iterative algorithm. 

10 • Each heap element, hypothesis, or tree copy has the current ESM lookahead 
score attached. When a hypothesis is expanded to the next node and the 
corresponding lookahead tree has been removed from the cache, the tree will not 
be recomputed. Instead, the lookahead probability of the prefix is propagated 
forward ("lazy cache" evaluation). 

15 Typical embodiments use polyphonic within-word acoustic models, but 

triphone acoustic models across word boimdaries. To incorporate crossword 
modeling in a single-prefix-tree decoder, the context dependent root and leaf 
nodes are dealt with. Instead of having context dependent copies of the prefix 
tree, each root node may be represented as a set of models, one for each possible 

20 phonetic context. The hjrpotheses of these models are merged at the transition to 
the within word units (fan-in). As a compact means of representing the fan-in of 
root nodes and the corresponding fan-out of leaf nodes, the notion of a 
multiplexer was developed. A multiplexer is a dual map that maps instances i to 
the index of a imique hidden Markov model for the context of i: 

25 mpx(0:zH->ie {0,1,....//^,} 

mpx[i]:i H->m€ {nt^,f7^,..,,m^,^] 
where , m, m^^^ are unique models. The set of multiplexer models can be 
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precomputed based on the acoustic modeling decision tree and the dictionary of 
the recognizer. Figure 1 shows the general organization of a multiplexer-based 
prefix search tree showing various type of nodes including a root node 10, 
internal node 12, leaf node 14, and single phone word node 16 (also called a stub). 

To model conversational speech, multiplexers are particularly useful since 
the augmented attribute representation of words leads to an explosion in the 
number of crossword contexts. Because multiplexers map to unique model 
indices, they basically implement a compression of the fan-in/out and a technique 
to address the context dependent model by the context instance i. 

The heap structure of a root node, 10 in Fig. 1, is shown in Figure 2. The 
root node 10 represents the first attribute instance of words in terms of a 
corresponding multiplexer 20. The structure also includes for each node a finite 
state machine grammar state 22 and corresponding state index 24. Cost structure 
26 contains a finite state machine lookahead score for the node. Score structure 28 
contains a total best score of h5rpotheses, the acoustic score plus expected FSM 
cost. Heap policy orJy merges hypotheses that have the same history or linguistic 
theory, and whose final instances and map to the same context dependent 
word initiail model: mpx(z^ ) = mpx(z^ ). This means that the heap is used to keep 
track of different contexts, the FSM state (representing the linguistic context), as 
well as acoustic contexts. In word internal nodes 12, only hypotheses found to be 
in the same FSM state are collapsed. 

For every word there is a leaf node, 14 in Fig. 1, the heap structure of which 
is illustrated by Figure 3. A multiplexer describes the leaf node fan-out, and each 
heap element represents the complete fan-out for a given grammar state. 

Figure 4 illustrates the heap structure for a single-phone instance word 
node or stub, 16 in Fig. 1. Words consisting of only one phone are represented by 
a multiplexer of multiplexers. Def>ending on the left context of the word, this stub 
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multiplexer returns a multiplexer representing the right-context dependent fan- 
out of this word. The heap policy is the same as for root nodes, and each heap 
element represents the complete fan-out as for leaf nodes. 

In addition to the acoustic and the word end beam for pruning the 
5 acouistics, two heap related controls may also be used: (1) the maximum number 
of heap elements can be boimded, and (2) there can be a beam to prune 
hj^otheses within a heap against each other. The number of finite state machine 
states expanded at each time t can be constrained as well (topN threshold). 
Acoustic model evaluation is sped up by means of gaussian selection 

10 through Bucket Box Intersection (BBI) and by Dynamic Frame Skipping (DFS). 
Thus, acoustic models are reevaluatedoniy provided the acoustic vector changed 
significantly from time t to time t+1. A threshold on the Euclidean distance is 
defined to trigger reevaluation of the acoustics. To avoid skipping too many 
consecutive frames, only one skip at a time may be taken — ^after skipping one 

15 frame, the next one must be evaluated. 

To assess the performance of a representative embodiment of the decoder 
imder tight realtime constraints, an evaluation test started from a Switchboard 
recognizer trained on htiman-to-human telephone speech. The acoustic front end 
computed 42 dimensional feature vectors consisting of 13 mel-frequency cepstral 

20 coefficients plus log power and their first and second derivatives. Cepstral mean 
and variance normalization as well as vocal tract length normalization were used 
to compensate for charmel and speaker variation. The recognizer consisted of 8000 
pentaphonic Gaussian mixture models. A 15k word recognition vocabulary and 
approximately 30k dictionary variants generated by a mode dependent 

25 pronimdation model were used for decoding. Without MLLR adaptation, and 
decoded with a Switchboard trigram language model trained on 3.5 million 
words, the base performance at lOOxRT was 37% word error rate (nm-on, one- 
pass recognition on NIST Evar96). Groups participating in recent MIST 
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evaluations reported decoding times in the order of 300 realtime factors (which 
included multiple adaptation passes). 

Table 1 shows the jRrst word accuracy results of our Switchboard 
recognizer at aroimd ten times realtime. This shows tight pnming in the context 
5 of highly confusable Switchboard speech. TopN=10 means that only 10 finite 
state machine states were expanded per frame. DFS indicates Dynamic Frame 
Skipping, and BBI indicates Bucket Box Intefrsection. 



Condition 


RT 


WER7o 


Baseline 


100 


37 


Tight Beams, :topN=10 


12 


43.8 


Tight Beams, topN=10, DFS 


7 


45.6 


Tight Beams, topN=10, DFS, BBI 


5 


49.8 



Table 1 



Embodiments of the invention may be implemented in any conventional 

10 computer programming language. For example, preferred embodiments may be 

implemented in a procedural progranruning language (e.g,, "C") or an object 

oriented programming language (e.^., "C++")- Alternative embodiments of the 

invention may be implemented as pre-programmed hardware elements, other 

related components, or as a combination of hardware and software components. 

15 Embodiments can be implemented ais a computer program product for use 

with a computer system. Such implementation may include a series of computer 

instructions fixed either on a tangible medium, such as a computer readable 

medium {e.g., a diskette, CD-ROM, ROM, or fixed disk) or transmittable to a 

computer system, via a modem or other interfece device, such as a 

20 commimications adapter connected to a network over a mediimi. The medium 

may be either a tangible medium {e.g., optical or analog communications lines) or 

a medium implemented with wireless techniques {e.g., microwave, infrared or 

other trai\smission techniques). The series of computer instructions embodies all 
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or part of the functionality previously described herein with respect to the system. 
Those skilled in the art should appreciate that such computer instructions can be 
written in a number of programming languages for use with many computer 
architectures or operating systems. Furthermore, such instructions may be stored 

5 in any inemory device, such as semiconductor, magnetic, optical or other memory 
devices, and may be transmitted using any communications technology, such as 
optical, infrared, microwave, or other transmission technologies. It is expected 
that such a computer program product may be distributed as a removable 
mediimi with accompanying printed or electronic documentation {e.g., shrink 

10 wrapped software), preloaded with a computer system {e.g., on system ROM or 
fixed disk), or distributed from a server or electronic bulletin board over tiie 
network {e.g., the Internet or World Wide Web). Of course, some embodiments of 
the invention may be implemented as a combination of both software {e.g., a 
computer program product) and hardware. Still other embodiments of the 

15 invention are implemented as entirely hardware, or entirely software {e.g., a 
computer program product). 

Although various exemplary embodiments of the invention have been 
disclosed, it should be apparent to those skilled in the art that varioxis changes 
and modifications can be made which will achieve some of the advantages of the 

20 invention without departing from the true scope of the invention. 
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What is claimed is: 



1 1. An attribute-based speech recognition system comprising: 

2 a speech pre-processor that receives input speech and produces a sequence 

3 of acoustic observations representative of the input speech; 

4 a database of context-dependent acoustic models that characterize a 

5 probability of a given sequence of sounds producing the sequence of 

6 acoustic observations, each acoustic model including phonetic 

7 attributes and suprasegmental non-phonetic attributes; 

8 a finite state language model that characterizes a probability of a given 

9 sequence of words being spoken; and 

10 a one-pass decoder that compares the sequence of acoustic observations to 

11 the acoustic models and the language model, and outputs, at least 

12 one word sequence representative of the input speech. 

1 2. A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include spectking rate. 

1 3. A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include phone durations. 

1 4. A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include dialect. 

1 5. A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include gender. 

1 6. A speech recognition system according to claim 1, wherein the 
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2 suprasegmental non-phonetic attributes include fundamental frequencies. 

1 7. A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include hidden Markov model 

3 state. 

1 ^0^: A speech recognition system according to claim 1, wherein ttie 

2 suprasegmental non-phonetic attributes include word class. 

1 9- A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include articulatory features. 

1 10. A speech recognition system according to claim 9, wherein the articulatory 

2 features include stress. 

1 11, A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes possess discrete values. 

1 12. A speech recognition system according to claim 11^ wherein the discrete 

2 values are binary. 

1 13. A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes possess continuous values. 

1 14. A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include syllabic structure. 

1 piS; A speech recognition system according to claim 1, wherein the 

2 suprasegmental non-phonetic attributes include syntactic structure. 
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1 A speech recognition system according to claim 1, wherein the 



2 suprasegmental non-phonetic attributes include semantic structure. 

1 17- A speech recognition system according to claim 1, wherein the decoder 

2 further comprises: 

3 a probabilistic pronunciation model that characterizes possible 

4 pronimdation effects, 

5 wherein an acoustic model induction algorithm augments the acoustic 

6 models with the pronxmciation modeL 

1 18. A speech recognition system according to claim 1, wherein the decoder is a 

2 single-prefix-tree decoder. 

1 19- A speech recognition system according to claim 18, wherein the prefix tree 

2 includes nodes and leaves, and for each node and leaf a priority heap 

3 represents alternative lihgiiistic theories. 

1 20, A speech recognition system according to claim 19, wherein the heap 

2 maintains all contexts within a selected threshold probability. 

1 21. A speech recdgfdtion system according to claim 19, wherein lookahead 

2 cost trees are used to determine for every node a best language model score 

3 for all words with a given prefix. 

1 22. A speech recognition system according to claim 21, wherein lookahead 

2 trees are saved in an aging cache to avoid recomputing for subsequent 

3 frames. 
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A speech recognition system according to claim 21, wherein each heap 
element has a current lookahead score attached. 

A speech recognition system according to claim 19, wherein the decoder 
further comprises: 

a multiplexer to represent fan^in of root nodes and fan-out of leaf nodes. 

A speech recognition system according to claim 24, wherein the prefix tree 
includes root nodes to represent a first attribute instance of words for a 
given multiplexer. 

A speech recognition system according to claim 24, wherein the prefix tree 
includes word internal nodes wherein the decoder collapses alternative 
h5rpotheses only if the alternative hypotheses are in the same finite 
madtine state. 

A speech recognition system according to claim 24, wherein each word has 
a leaf node, and the multiplexer describes the fan-out such that each heap 
element represents a complete fan-out for a given grammar state. 

A speech recognition system according to claim 24, wherein a single phone 
word is represented by a multiplexer of multiplexers that, depending on a 
left context, returns a multiplexer representing a right cont^t-dependent 
fan-out of the single phone word. 

A speech recognition system according to claim 1, wherein the decoder is 
time S3mchronous. 
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1 30* A speech recognition system according to claim 1, wherein the decoder 

2 uses decision trees to induce a set of context-dependent duration and 

3 acoustic models. 

1 31. A speech recognition system according to daim 1, wherein the language 

2 model uses n-gram models in a finite state machine wrapper. 

1 32. A speech recognition system according to claim 1, wherein the decoder 

2 uses finite state recognition lattices that enable rescoring a word graph 

3 with alternative word models or language models. 

1 33. A speech recognition sjrstem according to claim 1, wherein the decoder 

2 uses a bucket box intersection technique. 



34. A speech recognition system according to claim 1, wherein the decoder 
uses a dynamic frame skipping technique. 



1 35. An attributerbased method of speech recognition comprising: 

2 pre-processing input speech to produce a sequence of acoustic observations 

3 representative of the input speech; 

4 characterizing, with context-dependent acoustic models, a probability of a 

5 given sequence of sounds producing the sequence of acoustic 

6 observations, each acoustic model including phonetic attributes and 

7 suprasegmental non-phonetic attributes; 

8 characterizing, with a finite state language model, a probability of a given 

9 sequence of words being spoken; and 
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10 comparing, with a one-pass decoder, the sequence of acoustic observations 

11 to the acoustic models and the language model, and outputs at least 

12 one word sequence representative of the input speech. 

1 36. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include speaking rate. 

1 37. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include phone durations. 

1 38. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include dialect. 

1 39. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include gender. 

1 40. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include fundamental frequencies. 

1 41. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes indude hidden Markov model state. 

1 42. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include word class. 

1 43. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes indude articulatory features. 

1 44. A method according to claim 9, wherein the articulatory features indude 
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2 Stress. 

1 45. A method according to claim 35^ wherein the suprasegmental non-phonetic 

2 attributes possess discrete values. 

1 46. A method according to claim 11, wherein the discrete values are binary. 

1 47. A method accordmg to claim 35, wherein the suprasegmental non-phonetic 

2 attributes possess continuous values. 

1 48. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include syllabic structure. 

1 49. A method accordirig to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include syntactic structure. 

1 50. A method according to claim 35, wherein the suprasegmental non-phonetic 

2 attributes include semantic structure. 

1 51. A method according to claim 35, wherein the comparing further comprises: 

2 characterizing, with a probabilistic pronunciation model , possible 

3 pronimciation effects, and 

4 augmenting the acoustic models with the pronimciation model using an 

5 acoustic model induction algorithm. 

1 52. A method according to claim 35, wherein the comparing uses a single- 

2 prefix-tree decoder. 

1 53. A method according to claim 52, wherein the prefix tree includes nodes 
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2 and leaves, and for each node and leaf a priority heap represents 

3 alternative linguistic theories. 

1 54. A method according to claim 53, wherein the priority heap maintains all 

2 contexts within a selected titureshold probab^ity. 

1 55. A speech recognition system according to claim 53, wherein lookahead 

2 cost trees are used to determine for every node a best language model score 

3 for all words with a given prefix. 

1 56. A method according to claim 55, wherein lookahead cost trees are saved in 

2 an aging cache to avoid recomputing for subsequent frames. 

1 57. A speech recognition system according to claim 55, wherein each heap 

2 element has a current lookahead score attached. 

1 58. A method according to claim 53, wherein the comparing further includes 

2 representing, with a multiplexer, fan-in of root nodes and fan-out of leaf 

3 nodes. 

1 59. A method according to claim 58, wherein the prefix tree includes root 

2 nodes to represent a first attribute instance of words for a given 

3 multiplexer. 

1 60. A method according to claim 58, wherein the prefix tree includes word 

2 internal nodes in which the decoder collapses alternative hypotheses only 

3 if the alternative hypotheses are in the same finite machine state. 

1 61. A method according to claim 58, wherein each word has a leaf node and 
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2 the multiplexer describes the fan-out such that each heap element 

3 represents a complete fan-out for a given graixunar state. 

1 62. A metiiod according to claim 58, wherein a single phone word is 

2 represented by a multiplexer of multiplexers that, deperiding on a left 

3 context, returns a multiplexer representing a right context-dependent f an- 

4 out of the single phone word. 

1 63. A method according to claim 35, wherein the comparing includes using a 

2 time synchronous decoder. 

1 64. A method according to claim 35, wherein the comparing includes using 

2 decision trees to induce a set of context-dependent duration and acoustic 

3 models. 

1 65. A method according to claim 35, wherein the language model uses n-gram 

2 models in a finite state machine wrapper. 

1 66. A method according to claim 35, wherein the comparing includes using 

2 finite state recognition lattices that enable rescoring a word graph with 

3 alternative word models or language models. 

1 67. A method according to claim 35, wherein the comparing includes using a 

2 bucket box intersection technique. 

1 68. A method according to daim 35, wherein the comparing includes using a 

2 d)niamic frame skipping technique. 
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